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Abstract Current statistical models for structured predic- 
tion make simplifying assumptions about the underlying out- 
put graph structure, such as assuming a low-order Markov 
chain, because exact inference becomes intractable as the 
tree- width of the underlying graph increases. Approximate 
inference algorithms, on the other hand, force one to trade 
off representational power with computational efficiency. In 
this paper, we propose two new types of probabilistic graph- 
ical models, large margin Boltzmann machines (LMBMs) 
and large margin sigmoid belief networks (LMSBNs), for 
structured prediction. LMSBNs in particular allow a very 
fast inference algorithm for arbitrary graph structures that 
runs in polynomial time with a high probability. This prob- 
ability is data-distribution dependent and is maximized in 
learning. The new approach overcomes the representation- 
efficiency trade-off in previous models and allows fast struc- 
tured prediction with complicated graph structures. We present 
results from applying a fully connected model to multi-label 
scene classification and demonstrate that the proposed ap- 
proach can yield significant performance gains over current 
state-of-the-art methods. 
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1 Introduction 

Structured prediction is an important machine learning prob- 
lem that occurs in many different fields, e.g., natural lan- 
guage processing, protein structure prediction and semantic 
image annotation. The goal is to learn a function that maps 
an input vector X to an output Y, where Y is a vector rep- 
resenting all the labels whose components take on the value 
+ 1 or —1 (presence or absence of the corresponding label). 
The traditional approach to such multi-label classification 
problems is to train a set of binary classifiers independently. 
Structured prediction on the other hand also considers the 
relationships among the output variables Y. For example, in 
the image annotation problem, an entire image or parts of 
an image are annotated with labels representi ng an object, a 
scene or an event involving multiple objects ( Carneiro et aj 
20071). These labels are usually dependent on each other, 
e.g., buildings and beaches occur under the sky, a truck is a 
type of automotive, and sunsets are more Ukely to co-occur 
with beaches, sky, and trees (Figure [Til. Such relations cap- 
ture the semantics among the labels and play an important 
role in human cognition. A major advantage of structured 
prediction is that the structured representation of the output 
can be much more compact than an unstructured classifier, 
resultin g in smaller sample complexity and greater general- 
ization ( Bengio et aj 2007 ). 

Extending traditional classification techniques to struc- 
tured prediction is difficult because of the potentially com- 
plicated inter-dependencies that may exist among the out- 
put variables. If the problem is modeled as a probabilis- 
tic graphical model, it is well-known that exact inference 
over a general graph is NP-hard. Therefore, practical ap- 
proaches make simplifying assumptions about the depen- 
dencies among the output variables in order to simplify the 
graph structure and maintain tractability. Exa mples include 
maximum entropy Markov models (MEMMs) (IMccallum et al , 
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2000h. conditional ra ndom fields (CRFs) dLaffertv et ail200ll 
Ouattoni et all 2004 ), max-margin Markov networks (M3Ns) 
(ITaskar et a ri2004h and structured su pport vector machines 
(SS VMs) ( Tsochantaridis et al 2004 ). These approaches typ- 
ically restrict the tree-widtlQ of the graph so that the Viterbi 
algorithm or the junction tree algorithm can still be efficient. 

On the other hand, there has been much research on 
fast approximate inference for complicated graphs based on, 
e.g., Markov chain Monte Carlo (MCMC), variational infer- 
ence, or combinations of these methods. In general, MCMC 
is slow, particularly for graphs with strongly coupled vari- 
ables. Good heuristics have been developed to speed up MCMC 
but they are highl y dependent on grap h structure and associ- 
ated parameters ( Doucet et al . 2000l) . Variational inference 
is another popular approach where a complicated distribu- 
tion over Y is approximated with a simpler distribution so 
as to trade accuracy for speed. For example, if the variables 
are assumed to be independent, one obtains the mean field 
algorithm. A Bethe energy formula tion yields the loopy be- 
lief propagation (LBP) algorithm dYedidia et all l2005b . If 
a combination of trees is considered, one obtains the tree- 

' In this paper, the tree-width of a directed acyclic graph refers to 
the tree-width of the corresponding undirected graph obtained through 
morahzation. 



reweighted sum-product algorithm ( I Wainwright et alll2005al) . 
One can also relax the higher-order marg inal constraints to 
obtain a linear programming algorithm ( Wainwright et aj 
2005bl) . The lesser the dependency constraints, the less accu- 



rate these inference algorithms become, and the faster their 
speed. However, the sacrificed accuracy in inference could 
be detrimental to learning. For example, mean field can pro- 
duce highly biased estimates, and loopy belief propagation 
might even cause the learni ng algorithm to diverge 
jKulesza and Pereirall2007b . 

Long-range dependencies and complicated graphs are 
necessary to accurately and precisely represent semantic knowl- 
edge. Unfortunately, the approaches discussed above all op- 
erate under the assumption that one cannot avoid the trade- 
off between the representational power and computational 
efficiency. 

In this paper, we propose large margin sigmoid belief 
networks (LMSBNs) and large margin Boltzmann machines 
(LMBMs), two new models for structured prediction. We 
provide a theoretical analysis tool to derive the generaliza- 
tion bounds for both of them. Most importantly, LMSBNs 
allow fast inference for arbitrarily complicated graph struc- 
tures. Inference is based on a branch-and-bound (BB) tech- 
nique that does not depend on the dependency structure of 
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the graph and exhibits the interesting property that the bet- 
ter the fit of the model to the data, the faster the inference 
procedure. 

Section|2]describes both LMSBNs and LMBMs. We present 
learning algorithms for both and the fast BB inference al- 
gorithm for LMSBNs. LMBMs, being undirected, rely on 
traditional inference algorithms. 

Section |4] applies both LMSBNs and LMBMs to the se- 
mantic image annotation problem using a fully-connected 
graph structure. We empirically study the performance of 
the BB inference algorithm and illustrate its efficiency and 
effectiveness. We present results from experiments on a bench- 
mark dataset which demonstrate that LMSBNs outperform 
current state-of-the-art methods for image annotation based 
on kernels and threshold-tuning. 



2 Large Margin Sigmoid Belief Networks and Large 
Margin Boltzmann Machines 



The sigmoid belief netwo rk (SBN) (lNeallll992l) and Boltz- 
mann machine (BM) (iHinton and SeinowskiL Il983h are a 
special type of Bayesian network and a special type of Markov 
random field respectively, and are defined as follows: 

Definition 1 A Boltzmann machine is an undirected graph 
G ~ (V,E), where V is the set of random variables with size 
K ~ |V|, E is the set of undirected edges. The joint likelihood 
is defined as: 



(1) 



Pr(V|w) = elL-y^eiL-M 

V 

Zi^ XI WijViVj + WiVi 

j:{Vi,Vj)£E 

where Z is the normalization constant. 



Definition 2 A sigmoid belief network is a directed acyclic 
graph G = {Y,E), where V is the set of random variables 
with size K = |V|, E is the set of directed edges. (V/, V;) rep- 
resents an edge from Vj to Vi. For each node V,-, its parents 
are in the set pa{Vi) = {Vj\{Vj,Vi) G E}. The joint likelihood 
is defined as: 

K 

Pr(V|w) =nPr(^i/'«(^-),w) (2) 

(=1 

Vr{Vi\pa{Vi),yi) = — ^ 
1 + e ^' 

Zi^ X! WijViVj -\-WiVi 

In BMs, the edges are undirected, so the feature v,vy ap- 
pears in both Zi and Zj. In SBNs, the edges are directed, so 
the feature ViVj appears in either Zi or Zj, but not both. One 
can generalize the function z, to utilize high order features 
over a set of variables. In probabilistic graphical models. 



this set is referred to as a clique. In SBNs or BMs, the fea- 
tures are defined as a product of all variables in the clique. 
For example, C\ — {Vi,V2,V3} is a 3rd order clique, f\ = 
vi V2V3. The edges are 2nd order cliques, e.g., C2 = {Vi, V2}, 
f2 = v\V2. The first order cliques are the variable themselves, 
e.g., C3 = {Vi}, /s = vi. When the variables take values 
{ — 1,1}, the feature function is also known as the parity 
function or the XOR function. Therefore, a SBN or BM 
softly encodes a Boolean function via an AND-of-XOR ex- 
pansiorll, which provides a flexible way to encode human 
expert knowledge into the model. Without ambiguity, we 
simplify the representation of Zi to be z, = ^ijfj^ where 
the summation is taken over all cliques that include variable 
V/. For SBNs, We require that all the variables in each clique 
Cj other than V, must be parents of V,. This requirement in- 
sures that the underlying graph is acychc, and each Cj is 
used in one z,-. 

In the structured prediction setting, the problem involves 
an input vector X, and the joint probability over all Y is con- 
ditioned on X, i.e., Pr(Y|X,w). Note that z, is defined for 
each y, although the cliques include both X and Y. 

When there is only one output variable, i.e. K = I, the 
conditional likelihoods of both SBNs and BMs become the 
same, i.e., Pr(y = 3'|x;w) = where z = yT^j^j^ji^)- 

The features are fj{x,y) = y^j{x). This is the well known 
logistic regression (LR) with a loss function L{y,x,'w) = 
log(l +e^^'). In fact, a SBN can be considered as a product 
of LRs according to a topological order over the graph. The 
overall loss function is then L(y,x,w) = Lilog(l +e^-^'). A 
BM needs normalization over all Y, the loss function usu- 
ally can not be factorized locally that puts some challenge 
on learning. 

To facilitate the derivation of a fast inference algorithm 
for LMSBNs and a fast learning algorithm for LMBMs, we 
use a hinge loss, [1 — z]+ = max(0, 1 — z) to approximate 
the log-loss log(l +e^~). We call the resulting SBN a large 
margin sigmoid belief network (LMSBN) and the resulting 
BM a large margin Boltzmann machine (LMBM). The ap- 
proximations are presented in Remark|3] The approximation 
of LMBM is similar to pseudo likelihood approximation of 
a Markov random field. The only difference is the extra reg- 
ularization. In the latter section, we will show that this reg- 
ularizer is crucial for LMBMs to generalize well. Note that 
for LMSBNs, each feature fj only appears in one z,, but for 
LMBMs, each feature fj appears in all Zi where Yt £ Cj. 



- This is different from the ring-sum expansion which is an XOR- 
of-AND expansion. 
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Remark 3 

i'SBA'(y,X,w) < LLMSBN{y,1i;'w)+Kb 
i'BM(y,X,w) < LLMBM(y,X,w)+^:/7 + g(x)||w|| 

i'LM5BA'(y,X,w) = LLMBM(y,X,w) = ^[1 (3) 

i 

= E ^jfj 

j-YieCj 
b = log(e + e"') 



Proof From Figure|2l it is easy to verify that log(l < 
[1 — z]+ + /j, which leads to the first upper bound for SBN. 
For BM, because the features involves multiple variables 
y, appear in all corresponding zi, which makes the upper 
bounding much harder. Here we prove the second upper bound 
as follows: 



LBM(y,X,w) 



< 



< 



I 

1 



-log^e- 

Y 



^xE^'- 



-log E e2^'5^'"' E 
YVi-i i'i={-yi,>-i} 

-log E 
Y\yi 

-log E e^^'7^{i-2}~' 

Y\{Fi,F2} 



< 



E ^ 

5'2={-V2.>'2} 

^i^{L2} Y\{yi,y2} 



,[l-Z2l+-H[l-Jil+-H26-hgi(x)i:y:Kj^r2eCj"'j 



<E[i 



Kb + g{x) E 



(4) 



since the hinge loss for Y\ also contains Y2, when the par- 
tition function marginalizes Y2, we have to relax the sum- 
mation with a term proportional to the norm of the weights 
whose corresponding cliques include both Y\ and Yi- This 
relaxation is represented by ^1 (x) Y^j-Jx .y^eC, '^y' where gi is 
a constant determined by x. After the whole partition func- 
tion being relaxed, the upper bound contains a regularizer on 
all the weights whose corresponding cliques include at least 
two output Y . The set of all these cliques is □ 

Output values are predicted by minimizing the loss func- 
tion, as shown in Equation|5]below. With an I2 norm regular- 
ization on the weights w, the training problem for LMSBNs 
is defined as in Equation |6] below. Note that, for LMBMs, 
there is an extra £2 regularization on the weights among the 



Log Loss 
Hinge Loss 
Hinge Bounc 
0-1 Loss 




-4 -3 -2 -1 

Fig. 2 Losses and upper bound 

output Y, but no regularization on the weights for individual 
Y or between X and Y. 



y = argminL(y,x,w) 
y 



w = argmin-Ei'(y/,x,,w)+/?(w) 



(5) 
(6) 



i=\ 



LMDBNs : R{y/)=X\\v/\\\ 
LMBMs : R{ys/) = X\\ys/\\\ + X^Q\\y^<ff,\\l 



2.1 Generalization Bound 

One major concern of structured prediction, as well as all 
classification problems, is generalization performance. Gen- 
eralization performance for structured prediction has not been 
as well studied as for binary and multi-class classification 
(iTaskar et al'.'2004':' Tsochantaridis et" a?.'2004\'Da ume III et ai 
12009.) . Both .Taskar et all and I 



Tsochantaridis et al employed 



the maximum-margin approach that builds on binary sup- 
port vector machines (SVMs). Generalization performance 
can be addressed by an upper bound on the prediction er- 
rors. However, the derivation of the bound is specifically 
restricted to the loss fu nction they use , and hard to apply 
to other loss functions. Daume III et all consider a sequen- 
tial decision approach that solves the structured prediction 
problem by making decisions one at a time. These sequen- 
tial decisions are made multiple times, and the output is ob- 
tained by averaging all results. The generalization bound is 
analyzed in terms of all these binary classification losses. 
One major drawback of this approach is that the averaged 
losses for the averaged classifiers need a large number of it- 
erations to converge. Even if it converges, the bound is still 
loose compared to the bound we presented. We will discuss 
this further in Section|3] 

In this section, we provide a general analysis tool for 
both single variable classification and structured prediction 
that allows arbitrary loss functions and holds tight. We first 
need the following threshold theorem: 
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Theorem 4 Assuming X,Y &,if \/y ^ y, 3T > 0, s.t. 
L(x,y,w) > T, then 



Pr(y^y|w)<^E0[L(x,y,w)] 



(7) 



Proof 

Pr(y^y|w) = E5,[l(y^y)] 

<E5,[^(L(x,y,w)-r)] 

< iE5,[L(x,y,w)] 

The function l(z) is the indicator function that is 1 when z is 
true and for false, ^(z) is the Heaviside function that is 
1 for z > and otherwise. The last inequality comes from 
the fact that ^(z -T)<f. 

This threshold theorem allows one to discuss the pre- 
diction error bounds for any number of outputs with any 
loss function. For example, for logistic regression (LR), L = 
log(l + e^~) > log2 whenever a mista ke is made, s o the 
threshold T for LR is log 2. InAdaboost dZhan^lioOll) . L = 
e^^' > 1 whenever it makes a mistake, so T = 1. For SSVMs, 
L~[l — maXy/-^y(z(y,x,w)— z(y',x,w))]+ > 1 whenitmakes 
a mistake, so T is again 1 . Then the prediction errors for all 
these classifiers are upper bounded by the expected loss di- 
vided by the threshold T. 

According to this theorem, the goal of all classification 
tasks is to find the hypothesis that predict with an expected 
loss as low as possible. On the other hand, for LMSBNs, 
there is a fast inference algorithm whose performance di- 
rectly dependents on this quantity. The smaller the expected 
loss, the faster the inference. For both of the above reasons, 
the log-loss and exponential-loss are unfavorable because 
they are usually larger than zero even if the model fits the 
data well. Therefore, we choose the hinge loss as the loss 
function for both LMSBNs and LMBMs. 

The threshold for LMBMs is given in Remark |5] and 
the threshold for LMSBNs is given in Remark|6l For a tight 
bound, the threshold should be large enough, so for LMBMs, 
we need to constrain the weights among the output variables. 
In other words, if the coupling between outputs is stronger 
than the coupling between an output and an input, then the 
possibility of overfitting increases. This also explains why 
the approximate loss of LMBMs contains regularizations for 
the coupling weights among the output variables. However, 
for LMSBNs, the threshold is always 1 . Generally speaking, 
LMSBNs can be expected to generalize better than LMBMs. 



Remarks For LMBMs, T = imni[y - g{x)'£j:CjS'^'wj]+, 
for some g. 



Proof For any y; ^yi, we have [1 — z,]+ = [1 — Aq— Ai]^ 
Zi]+ = [1 -Ao-A2]+ where Aq = I.j,f.^f.Wjfj, 



M = 'Lj-f.^f. Wjfi- Since all y takes 1,+1}, S0A2 = — Ai, 
and [1 -!,•]+ = [l-Ao+Ai]+. 

If Ai < 0, we have L > [1 > [1 -Ao]+. Otherwise, 

L > L > [1 -z,]+ > [1 - Ao] + . So L > [1 - 1^.^^^;^ Wjfj] + . 

We can further loosen it to L > min,[l — g{'s.)'^j;Cje'if' ^yln 



□ 



Remark 6 For LMSBNs, T = 1 

Proof Pick the first y, in the topological order that does not 
equal the optimal value, i.e. 7^ yi and Vyj -< yi,yj = yj. 
Let Li — [1 — Zi]+ and L; = [1 —£,]+. Since Y takes values 
{ — 1,1} and only yi ^ in zi, it is easy to verify that Zi = 
—Zi- So, we have L; = [1 +£;] + . If Z/ > 0, we haveL > L, > L 
Otherwise, L>L>Li> I. □ 

We assume all data are drawn from the same unknown 
distribution Since is unknown, one can only minimize 
the empirical risk rather than the expected risk. A fast con- 
vergence rate of the empirical objective to th e expected one 
was proved in ( Shalev-Shwartz etalL 2008 ) for the single 
output variable case. We can extend it to the general struc- 
tured output case by providing a structured Rademacher com- 
plexity bound, as shown in Lemma [T] 

Lemma? Lef ^ = {x,y i-)- L(x,y,w)}, 

■^i = {x,y Ij:i',GC, Wjfj}' W = [1 -2]+- have 



E 



sup (E/i - Eiv/?) 



< 



Proof 



E 



< E 



< 



sup (E/j - Ea,/i) 



E sup i£(/,:(x;,y;)-/.;(x,,y,)) 



Here i^Af is the Rademacher complexity jBartlett and Mendelson , 
2003h of sample size A^. See jBartlett and Mendelson[|2003h 
for details on the notation. 



□ 



Toget her with LemmaQand Corollary 4 in dShalev-Shwartz et al , 
2008h . we can now derive a generalization bound as in The- 
orem[8] 

Theorem 8 Ler^(w) =E@[L(x,y,w)], w,, ~ arginf„ (w). 
Assuming Y-jfj < B^> far any 5 > 0, with probability 1 — 5 
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over the sample size N, if X = c where c is a con- 
stant, we have 



Pr(y ^ y|w) < -if (w) 



=^(w) <^(w„) + B|lw 



The basic idea of the structured Rademacher complexity 
is to bound the whole functional space by a combination of 
the Rademacher complexity of each subspaces. For LMBMs, 
a will be shared by all z, where F,- e Cj. So the subspaces 
overlap with each other, and the overall Rademacher com- 
plexity counts the features multiple times while B counts 
only once. Therefore the generalization bound is loosened 
by ^/d, where d is the maximum clique size. The more com- 
pUcated the graph, the larger the d. For LMSBNs, each fea- 
ture only appears in one subspace, so d is always 1 . Hence 
the bound for LMSBNs is tighter than for LMBMs. 

Furthermore, the bound given above is better than the 
PAC-Bayes bound of SSVMs and is not affected by the in- 
ference algorithm. For SSVMs, when there is no cheap ex- 
act inference algorithm available, the PAC-Bayes bound be- 
comes worse du e to the extra degrees of fre edom introduced 
by relaxations ( Kulesza and Pereiral 2007), leading to po- 
tentially poorer generalization performance. 



2.2 Learning Algorithm 

For LMSBNs, the learning problem defined in Equation |6] 
can be decomposed into K independent optimization prob- 
lem^ Each of them can be solved efficiently by any of 
the modern fast solvers such as the dual coordinate descent 



algorithm (iHsieh et all 120081) (DCD), th e primal stochastic 
gradient descent algorithm (PEGA SOS) (IShalev-Shwartz et al , 
2007 ; Bottou and Bousquel 2008h or the exp onentiated gra- 
dient descent algorithm dColUns et aj l2008b . For LMBMs, 
the weights are shared in multiple Zi, one has to optimize 
the w hole objective simultaneously. Similar to ( Hsieh et"ai[ 
20081), we give a dual coordinate descent based optimization 
algorithms for LMBMs. 

Consider the following primal optimization problem: 

min -T^ 'njW^i + -J— V £■/ 

subject to ^jfji ^ 1 - ^ii 

j-YieCj 

in > 

where r]j = 1 if Wj is not extra regularized; otherwise, r\j = 
1 + Tjo- The index / represents each training data. Let a,/ and 

A should be the same, otherwise Theorem[8]does not hold. 



Pii be Lagrange multipliers. Then, we have the Lagrangian: 

ii j-.Y/eCj 
We optimize L with respect to w and ^ : 

^. = ^j^.i-L L «,//,/ =0 

"^^^J I tY,eCj 

dL. 1 

Substituting for w and ^ , we have the dual objective: 

'L'" = ^ E L E «//«/'/'G;7/'-E0=f 

j.l,l'i:YieCji':Y.,eCi il 

where Q,/// = ' /' . The dual coordinate descent algorithm 

r\j 

picks a,7 one at a time and optimizes the dual Lagrangian 
with respect to this variable. The resulting algorithm is de- 
scribed in Algorithm 1 . 



Algorithm 1 The dual coordinate descent algorithm for 

large margin Boltzmann machines 

Input: {/,,}, {e,7,}, A, W 
Output: w 

1: 
2 
3: 
4: 
5: 



a <— 0,w<— 
while a is not optimal do 
for all a,; do 

a,, ^ a,7 

G = Eyii^ec, Wjfji - 1 

r min(G,0) a„ = 0, 

PG= I max(G,0) ^, 
I G < «„ < ^ 

if \PG\ 7^ then 
a,; <— min{max(ao 



<- Wj + (a,7 - a„)fji,\/j : Yi 6 Cj 



return w 



2.3 Inference Algorithm 

In this section, we propose a simple and efficient inference 
algorithm (Algorithm 2) to solve the prediction problem in 
Equation|5]for LMSBNs. According to the topological order 
of the graph, we branch on each Yj, and compute z, with x 
and all of its parents yy. We first try the value of y, that makes 
Zi > 0, i.e., the left branch in the algorithm, then the right 
branch with the opposite value of y, . During this search, we 
keep an upper bound initialized to a parameter S>1. When- 
ever the current objective is higher than the upper bound, we 
backtrack to the previous variable. The search terminates be- 
fore states of Y have been visited. The following theo- 
rem shows that with a high probability, the above algorithm 
computes the optimal values in polynomial time: 
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Algorithm 2 The Branch and Bound Algorithm for Infer- 
ence 

Input: X, w, S > 1 
Output: y,UB 



19961) and demonstrates that the experience gained during 



f/6 = S,i = 0,U = 0; 
while >= do 
if = AT then 

if Uk < UB then 

UB = Uk, y = y; 
! = i-l; 
else 

if Left branch has not been tried then 

V; = argmaxv, Z;, f/i+i =Ui + [l- z;]+; 
else if Right branch has not been tried then 

yi = -yi,U,+i =f/, + [l+z,]+; 
if C/,+1 > UB or both branches have been tried then 

i = i- 1; 
else 

! = /+l; 



Theorem 9 For any S > I, the BB algorithm reaches the 
optimal values before 0{K^) states are visited with a prob- 
ability at least 1 — |^(w). 

Proof During the search, if we branch on the right, the hinge 
loss [1 +z,]+ is greater than 1. So, for a given x, if the true 
objective L<S, the optimal objective L<L<S as well, and 
the optimal path contains at most S right branches. Since the 
BB algorithm always searches the left branch first, the opti- 



s-i 



mal path will be reached before ^jL, 



<0{K^) states 



have been searched. According to the Markov inequality, 
Pr(L <S) = l- Pr(L > 5) > 1 - ^^(w). 

The BB algorithm adjusts the search tree according to 
the model weights. Through training, optimal paths are con- 
densed to the low energy side, i.e., the left side of the search 
tree with a high probability. This probability is directly re- 
lated to the expected loss with respect to the given data dis- 



training can speedup a problem solver significantly. 

The BB algorithm is specifically designed for LMSBNs, 
a directed graphical model. For undirected models, the BB 
algorithm does not guarantee a polynomial time complex- 
ity with a high probability. Indeed, we observe an expo- 
nential time complexity when it is applied to LMBMs. For 
the undirected models including SSVMs and LMBMs, we 
imple ment a convex relaxatio n-based linear programming 
(LP) dWainwright et alll2005bl) . Note that although LMBMs 
don't have a fast inference algorithm, unlike SSVMs, the 
learning is not affected by the inference algorithm. In the ex- 
periments section, we will show that LMBMs outperforms 
SSVMs. 

The BB algorithm differs from other search-based de- 
coding algorithms, e.g., bea m search and best first search 
( lAbdou and Scordilisl 120041) . in several aspects. First, those 
search algorithms typically prune the supports of maximum 
cliques that can grow exponentially. On one hand, the prun- 
ing can lead to misclassification quickly if backtracking is 
not implemented. On the other hand, the number of remain- 
ing states might still be large so that the inference is still 
slow. Furthermore, even if a backtracking procedure is im- 
plemented, unlike the BB algorithm for LMSBNs, there are 
still no guaranteed heuristics that can prune the states effi- 
ciently and correctly. 

To demonstrate the efficiency and the data dependency 
property, we run the algorithms on the test data of RCVl- 
V2 (a text categorization dataset) with a trained model and 
a random untrained model. The running times are collected 
by varying the number of output variables. The CPU time is 
measured on a 2.8Ghz Pentium4 desktop computer 

The upper graph in Figure [3] demonstrates that the BB 
algorithm performs several orders of magnitude faster than 
In this experiment, S is set to a very large value such 



tribution. We therefore label the BB algorithm a t/flto-^fe;5en^fenf that the solution from BB is guaranteed to be the optimal 



inference algorithm. Most popular inference algorithms for 
exact or approximate inference depend on graph complex- 
ity: the more complicated the graph, the slower the infer- 
ence. This trade-off diminishes the applicability of these al- 
gorithms and presents researchers with the difficult problem 
of selecting a (possibly sub-optimal) graph structure that 
balances the accuracy and the efficiency. The BB algorithm 
for LMSBNs circumvents this trade-off and allows arbitrary 
complicated graphs without sacrificing computational effi- 
ciency. In fact, if a particular complicated graph yields a 
smaller expected loss, the BB algorithm in turn runs even 
faster 

It is well-known that for NP-hard problems, there may 
be many instances that can be solved efficiently. The area 
of speedup learning focuses on learning good heuristics to 
speedup problem solvers. The approach presented here can 



solution. The running time of LP with respect to the num- 
ber of output variables does not vary from a trained model 
to a random untrained model, but the running time of BB 
changes significantly. For the random untrained model, the 
BB algorithm demonstrates an exponential time complexity 
with respect to the number of output variables. However, af- 
ter training, the running time of the BB algorithm scales up 
much more slowly. 

This observation underscores the data distribution de- 
pendent property of BB, i.e., the better the model fits the 
data, the faster BB performs. We illustrate this property fur- 
ther by a second experiment. In this experiment, the prob- 
ability of the BB algorithm reaching the optimal values is 



The speed measuremen t of LP is comparable to Finley et al. 
dFinlev and Joachimi . l2008h . According to their experiments, graph 
'■'it^- and Innny belief propagation can perform 10-100 times faster, but 
be regarded as a novel method for speedup learning dTadepalh and-(Malliimiiiah. slower than BB. 



8 



Inference speed comparison 




BB for a random model 
BB for a trained model 
LP for a random model 
LP for a trained model 



80 



too 



Number of labels 




Fig. 3 Upper graph. Running time comparisons of LP and BB algo- 
rithms on the test dataset of RCV1-V2. The dashed lines are 1 standard 
deviation above the mean. The time axis is log-scaled. Lower graph. 
Accuracy and data distribution dependency. The more the training data, 
the more accurate and faster the prediction. The corresponding esti- 
mated theoretical lower bounds are plotted in blue. 



plotted by varying the cutoff threshold S. According to The- 
orem |9] S reflects the running time overhead for the BB al- 
gorithm. We compare this curve for several models, namely, 
a random model and models trained with 10, 100, 1000, and 
3000 training instances respectively. The lower graph in Fig- 
ure [3] shows a significant improvement for the trained mod- 
els over the random untrained model. Moreover, with more 
and more training instances, more and more test instances 
can be predicted exactly and quickly. In the same figure, we 
also plot the corresponding theoretical lower bounds esti- 
mated from the testing dataset (blue lines). The lower graph 
of Figure [3] verifies Theorem|9]empirically. 

Due to the fast and accurate inference algorithm for LMS- 
BNs, we can start with the most complicated graph struc- 
ture, i.e., a fully connected model. The linear form of z, can 
be generalized to high order features. Moreover, the kernel 
trick can be applied to augment the modeling power. The 
only thing one needs to concern is to minimize the expected 
loss as much as possible because the small expected loss 
guarantees not only a high prediction accuracy but also a 
fast inference. 



3 Related Models 



Most maximum margin estimated structured prediction mod- 



els, e.g., SSVMs ( Tsochantaridis et ai 



.120041). maxim um mar- 
gin Markov networks (M3Ns) (iTaskar et all 2004 ), Maxi- 



mum margin Bayesian networks (M2BNs) ( Guoeta iil2005l) 
and c onditional graphical models (CGMs) dPerez-Cruz et al , 



20071) adopt a min-max formulation as shown below: 

1„ „9 1 



XN 



Et^ (y/,y)-m(x/,y,;w)]+(8) 



»j(x/,y/;w) 



max'f^(x,,y,;w) 



f(x/,y;w) 



where f is the compatibility function derived from a prob- 
abilistic model, and m is the margin function. 

The embedded maximization operation potentially in- 
duces an exponential number of constraints. This exponen- 
tial number of constraints makes optimization intractable. 
In M2BNs, the local normaUzation constraints makes the 
proble m even harder . SSVM s utilize a cutting plane algo- 
rithm (I Joachims et a]ll2009l) to select only a small set of con- 
straints. M3Ns directly treat the dual variables as the decom- 
posable pseudo-marginals. When the undirected graph is of 
low tree-width, both SSVMs and M3Ns are computationally 
efficient and generalize well. However, for high tree-width, 
approximate inference has to be used and both the compu- 
tational co mplexity and the sample complexity increase sig- 



nifica ntly (iKulesza and PereiraL l2007t iFinlev and Joachims , 



2008) 



CGMs decompose the single hinge loss into a summa- 
tion of several hinge losses, each corresponding to one fea- 
ture function, such that the exponential number of combina- 
tions is greatly reduced. The decomposition from one hinge 
loss to multiple hinge loss is similar to LMBMs and LMS- 
BNs. However, CGMs decompose to each feature function. 
For real problems, not every feature function could be com- 
patible to the data, which leads to a large and trivial upper 
bound. Therefore, the performance can not be guaranteed. 

The large margin estimation by the threshold theorem 
^generalizes the maximum margin estimation approach. As 
long as the loss function satisfies the threshold theorem, there 
is a margin function implicitly defined such that minimiz- 
ing the expected loss maximizes the margins. The traditional 
log-loss based models, e.g., CRFs and MEMMs, can be dis- 
cussed under the large margin estimation framework, but 
the thresholds are possibly small so that the upper bounds 
become trivial. This suggests that large margin estimated 
models could generalize better than maximum likelihood es- 
timated models. 

For problems like semantic annotation, a low-treewidth 
graph usually is insufficient to represent the knowledge about 
the relationships among the labels. The example in Figure[T] 
illustrates the motivation for a high-treewidth graph. All of 
the models discussed above lack a fast and accurate infer- 
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ence algorithm for high-treewidth graphs, and are subject to 
the trade-off between the treewidth and computational effi- 
ciency. 

To speed up inference for a high-treewidth graphical mod 
els, one can use mixture mo dels to represent proba bilities. 
For example, MoP-MEMMs dRosenberg et alLl2007l) extend 
MEMMs to address long-range dependencies and represent 



to a novel approach for structured prediction with high tree- 
width graphs. 

4 Experiments 

The performance of LMSBNs was tested o n a scene an- 
notati on problem based on the Scene dataset (IBoutell et al , 



the conditional probability by a mixture model.] Wain wright et alj ^ ^h^ ^^^^^^^ ^^^t^-^^ 1211 training instances and 1196 



uses a mixture of trees to approximate a Markov random 
fields. Both demonstrate performance gains but one still has 
to improve inference speed by restricting the number of mix- 
tures. 

Another line of research for hi gh-treewidth grap hical mod- 
els uses arithmetic circuits (AC) (iDarwichel I2OO0I) to repre- 
sent the Bayesian networks. The AC inference is linear in the 
circuit size. As long as the circuit size is low, the inference 
is fast. But learning the optimal AC is an NP-hard problem. 
Similarly, one has to improve inference spee d by penalizing 
the circuit size ( ILowd and Domingosll2008[) . 



test instances. Each image is represented by a 294 dimen- 
sional color profile feature vector (based on a CIE LUV-like 
color space). The output can be any combination of 6 pos- 
sible scene classes (beach, sunset, fall foliage, field, urban, 
and mountain). 

We compare a fully connected LMSB N with three other 
methods: binary classifiers (BCs), SSVMs ( Finlev and Joachimsi 



2008 ), threshold selected binary classifiers (TSBCs) dPan and Lin , 



20071) . BCs train one classifier for eac h label and predict in- 



depen dently. For SSVMs, we follow dFinlev and Joachims , 



2008 ) to implement a fully connected undirected model with 



The search based structured prediction (SEARN) dPaume Illbatiailly features. We implement a convex relaxation-hnsed 

2OO9I) takes a different approach than probabilistic graphi- 



cal models to handle the high tree-width graphs. It solves 
the structured prediction by making decisions sequentially. 
The later classifier can take all the earlier decisions as in- 
puts, which is similar to LMSBNs. In fact, the inference can 
be considered as the initial decision of the BB algorithm. 
The expected errors caused by this naive inference could 
be very high. SEARN implements an averaging approach 
to reduce the expected errors. It trains a set of sequential 
classifiers for each iteration and outputs the prediction by 
averaging the decisions made over all iterations. The ear- 
lier decisions will be fed into later classifiers, so the later 
classifiers possibly make fewer mistakes. By averaging over 
iterations, the expected loss are reduced thereafter Roughly 
speaking, the prediction errors will be bounded by this av- 
eraged expected los^ multiplied by log Compared to 
the bounds of LMBMs and LMSBNs, where the prediction 
errors are bounded by the minimum expected loss divided 
by the threshold T, the generalization bound of SEARN is 



hnear programming algorit hm f or inference, since in both 
dFinlev and Joachimsll2008h and dKulesza and PereiraLl2007h . 
the convex relaxation-based approximate inference algorithm 
was shown to outperform other approximate inference al- 
gorithms such as loopy belief propagation and graph cuts 
dKolmogorov and Zabihi l2002b . TSBCs iteratively tune the 
optimal decision threshold for each classifier to increase the 
overall performance with respect to a certain measure, e.g., 
exact match ratio and F-scores. Many labels in the multi- 
label datasets are highly unbalanced, leading to classifiers 
that are biased. TSBCs can effectively adjust the classifier's 
precision and recall to achieve state-of-the-art perfo rmance. 
In our comparisons, we borrow the best results from dFan and Lin , 



20071) directly. 



We implemented two BCs, a linear BC (BCl) and a ker- 
nelized BC (BCk), and three LMSBNs: (1) LMSBNlo is 
trained with default order, i.e., ascending along the label in- 
dices; (2) LMSBNlf is trained with the order selected ac- 



cording to the F-scores of the BC. We sort the variables ac- 

rather loose. Furthermore, according to ( Daume III et ai 2009 ), cording to their F-scores of the BC. The higher the F-score, 
one needs a large number of iterations to reach that bound the smaller the index in the order; (3) LMSBNkf is a kernel- 

ized model with the same order as LMSBNlf. We also im- 
plemented two SSVMs: (1) SSVMhmm is trained by using 
a first-order Markov chain. It is different from the SSVM'""'" 
package that does not consider all inputs X for each Yj. The 
inference algorithm for SSVMhmm is the Viterbi algorithm; 
(2) SSVMfull is trained by using a fully connected graph. 

We consider three categories of performance measures. 
The first consists of instance-based measures and includes 
the exact match ratio (E) (Equation |9]l and instance-based 
F-score (Fsam) (EquationfTTTi. The second consists of label- 
based measures and includes the Hamming loss (H) (Equa- 
tion [TO|l and the macro-F score (Fmac) (Equation [T2]|. The 



which slows down the inference. Therefore, one still has to 
limit the number of iterations for a faster inference, which 
might sacrifice the prediction accuracy. 

Unlike all the above approaches, LMSBNs possess a very 
interesting property that one does not have any constraints 
on the modeling power. The smaller the expected loss, the 
faster the inference. Usually, one obtains a smaller expected 
loss by using a more complicated graph. This property leads 

^ The expectation is over the unknown data distribution, while the 
averaging is over the iterations. 
^ Suppose that the initial policy can make perfect predictions. 
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last is a mixed measure, the micro-F score (Fmic) (Equa- 
tion[T3]l. Fsam calculates the F-score for each instance, and 
averages over all instances. Fmac calculates the F-score for 
each label, and averages over all labels. Fmic calculates the 
F-score for the entire dataset. 



Fsam 



2L'i(.y/,-=.y/,- = i) 



Fmac = 



Fmic 



-E- 

A^rL-(i(3'/,- = i) + iCy/, = i)) 
^ri/(i(3'/,-i) + i(y« = i)) 
L7(i(y«-i) + i(.y/, = i)) 



(9) 
(10) 

(11) 
(12) 

(13) 
(14) 



The instance-based measure is more informative if the 
correct prediction of co-occurrences of labels is important; 
the label-based measure is more informative if the correct 
prediction of each label is deemed important. 

The results are shown in Figure |4] LMSBNkf consis- 
tently performs the best on all measures. Even the LMSBN 
models without kernels outperform TSBC on instance-based 
measures. 

SS VMhmm performs better than the BCl, but worse than 
the SSVMfull as expected. The inference speed of BCl is 
faster than SSVMhmm, which in turn is faster than SSVM- 
fuU. This demonstrates the trade-off between modeling power 
and efficiency. 

With the help of kernels, LMSBNkf further outperforms 
the TSBC on all measures. LMSBNs as proposed in this 
paper are geared towards minimizing 0-1 errors. Threshold 
tuning is particularly effective in the case of highly unbal- 
anced labels. An interesting line of research is combining 
LMSBNs with threshold tuning to further improve the per- 
formance. 



5 Conclusions 

This paper proposes the use of large margin graphical mod- 
els for fast structured prediction in images with complicated 
graph structures. A major advantage of the proposed ap- 
proach is the existence of fast training and inference algo- 
rithms, which open the door to tackling very large-scale im- 
age annotation problems. Unlike previous inference algo- 
rithms for structured prediction, the proposed BB inference 
algorithm does not sacrifice representational power for speed, 
thereby allowing complicated graph structures to be mod- 
eled. Such compUcated graph structures are essential for ac- 



curate semantic modeling and labeling of images. Our ex- 
perimental results demonstrate that the new approach out- 
performs current state-of-the-art approaches. Future research 
will focus on applying the framework to annotating parts of 
images with their spatial relationships, and enhancing the 
representational power of the model by introducing hidden 
variables. 
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